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ABSTRACT 

The National Assessment of Educational Progress (NAEP) estimates the number of students who 
are more likely to do certain problems right than other students, NAEP reports the numbers, 
briefly describes the problems and says more students need to do these problems rigrt. The press 
largely reports the news as presented. Reporters are not usually investigators, and news media are 
rot refereed journals, so they do rot conduct their own reviews to see if NAEP reports are right. 
More lead time for reporters, and having each NAEP report give a clear summary of the limitations 
of the data, would help improve coverage. However the limitations of the data are significant: 
especially lack of student motivation, and difficulty in describing the knowledge shown by 
students. These weaken our ability to draw conclusions from NAEP results. 

NAEP PROCEDURES TO ESTIMATE STUDENT SKILLS 
Number of Questions 

In NAEP each student has a few questions on each of several topic areas. For example in the 
1990 8th grade math tests, there were 5 topic areas, and an average of 12 questions per student per 
topic. Each student received one of 7 different test booklets, each of which covered all 5 topics. 
The booklets gave the students varying numbers of questions on the topics, as shown in the 
following table: 





Questions per Student 




Average 


Distribution 


Numbers and Operations 


20 


13, 18, 19, 20, 21, 23, 24 


Measurement 


9 


7, 8, 8, 8, 9, 10, 13 


Data Analysis, Statistics and Probability 


S 


6, 7, 8, 8, 9, 9, 10 


Geometry 


11 


9, 10, 10, 11, 11, 12, 15 


Algebra and Functions 


11 


9, 10, 10, 11, 11, 12, 12 



Source: Technical Report, pp. 22, 1*0, 247-52 [1] 
Some questions are easy, some moderate, some hard. Depending on the pattern of which questions a 
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student gets right, NAEP estimates how likely it is this student is very good, poor, or middling 
[more My described, with references, in the appendix tc this paper]. A student who answers all 
the problems right is likely to be a good student (though a perfect score might happen by guessing, 
or by the hick of knowing these specific problems, so NAEP recognizes there is some chance this 
student is only middling or poor). A student who misses some problems is considered by NAEP as 
likely to be a middling student However NAEP recognizes she might be a lucky poor student or an 
unlucky excellent student who misses problems for a host of reasons (has no incentive to try on 
this test, hasn't been taught these topics, works quickly and makes careless mistakes, works 
carefully on easy or interesting problems scattered around the test and doesn't firdsh, etc.). 
Thus when a reader sees low scores reported on a NAEP test (or most other tests), the reader must 
consider how likely it is that these scores measure knowledge of the topic area, versus motivation, 
luck, speed, etc 

Masters [2] criticizes tests that confound ability in a field with speed or with whether the 
student has been taught the topic Hambleton [3] suggests the need for several independent 
variables measuring these aspects. NAEP believes that its interpretations are correct, even though 
several dimensions are treated as one in the calculations W. 

NAEP ignores problems after the last one the student does in each 15 minute block of 
questions, but marks as wrong most of the questions that are skipped over without being answered: 
skipped questions get counted as right only about 1/x of the time, where x is the number of answer 
categories in the question [5]. Students however are not told that they should do the problems in 
order, or that they do not need to try to finish the test, so they may skip around or guess at hard 
questions at the end of the test, lowering the estimates of their skills. 

The scale from poor to good on these tests is called "proficiency" cr "grasp" in NAEP [6], and 
"ability" in most of the literature [7]. NAEP's terms are better ones for them, since NAEP tests 
measure how much the students have been taught, as well as innate ability. Some might say that an 
even better term would be "display," since the test measures what the student is willing to display 
under the test conditions. A student may be more proficient than he or she displays on this test. 
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Motivation to Do Well on the Test 

Some evidence on this motivation factor is available from students' performance on tests 
required for high school graduation in certain states. We can compare the results when these tests 
were required for high school graduation, to field tests in previous years when there were no 

penalties for poor scores. 

The following data show that when a serious incentive is present (high school graduation) 
scores are usually higher. The exceptions are English and composition in Louisiana, and reading in 
Montgomery County, in all of which the scores were fairly high already in the field tests. The 
differences seem especially pronounced for blacks and hispanics, ro the small extent data are 
available. Th<e change in incentives is combined with a change in student preparation, which will 
be discussed more below. 



Passing Grades as Percent of Students Taking the Test the First Time 
1991 90 89 88 87 S6 85 84 83 82 81 80 79 78 77 
Louisiana, the first two lines are grade 11, others are grade 10 



Science 


89 


87 


71* 69* 


Social Studies 


88 


89 


77* 70* 


Mathematics 


83 


82 


77 


71* 


English Language Arts 


85 


86 


83 


80* 


Written Composition 


95 


91 


75 


82* 



Maryland, Q-ade 9 

Writing 88 83 82 67 69 54* 51* 

Citizenship 75 76 71 73 66 59 42* 

(Statewide data on the field tests in math and reading are not available) 
Montgomery County, Maryland, Grade 9 

Mathematics 82 83 84 85 86 83 79 78 65* 

Blacks 61 63 65 4 67 65 57 53 34* 

Hispanics 62 61 67 68 (A 63 66 61 42* 
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1991 


90 


89 


88 


87 


86 


85 


84 83 


82 


81 


80 


79 78 


77 


Reading 




96 


97 


AO 

98 


97 


97 


97 


98 97 


96 


92 


92 


90 


89* 


DldCKS 




93 


oo 
93 


95 


9* 


94 


96 


95 93 


90 


83 


79 


72 


66* 


ruspanics 






88 


90 


87 


87 


86 


92 89 


87 


83 


81 


84 


78* 


Citizenship 


o c 




84 


81 


83 


81 


75 


62» 












blacks 


73 




68 


63 


67 


64 


56 


36» 












Hispanics 


63 




67 


61 


64 


61 


5» 


42* 













* Field tests or other "no fault" tests. The other tests, not starred, are required for high 
school graduation. Only first testings of each g-oup of students are shown, not re-testings. 
Source: State Departments of Education, and Montgomery County Public Schools [8] 
NAEP tests are penalty-free, like the "no fault" tests starred above. A junior high school teacher 
told me of watching students on standardized tests fill in box 1 on question 1, box 2 on question 
2, etc in neat diagonals down the page, or drop the pencil randomly on the answer sheet. When she 
asked them why they didn't at least tr* to answer the questions, they asked "Why bother?' and she 
had no very good reason to offer. A junior and senior high school principal says the schools don't 
know how to get most students to take seriously any test for which there is no penalty. Both the 
teacher and the principal said 8th grade is especially not a good year to get students' 
cooperation, and February not a good month, so the trial state math test is doubly damned 

The introductory script read to students in the math tests [9] does not offer any strong 
reason why students should try hard. It says, "the results will help government leaders, school 
administrators, and teachers" (not the favorite people of all students) and "will have an impact on 
schools and students," (vague?) so "we hope that you will do the best that you can." The script 
goes on to teach students how to use a scientific calculator (where the order of key strokes may be 
backwards from what students are used to) by 4 examples: 4 x 7.3 - 2, (80 - 14) x 6, 29, and 
pi. This is not what most educators would call a thorough lesson. Then there are some sample 
problems, including algebra, which is likely to frustrate students who have not studied algebra. 
Then there are personal background questions [10] which end with, "Does either your mother or your 
stepmother live at home with you? Does your mother or stepmother work at a job for pay?' and 
similar questions about "either your father or your stepfather." I understand researchers' 
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interest in these questions, but the topics of divorce and stepparents are very touchy for many Sri 
graders and may ieave students tense during the test itself. Then the third to last math 
background question is whether they agree or disagree that "mathematics is more for boys than for 
girls." Girls faced with this question may legitimately get angry at the presumption of posing 
such a question. 

Overall, motivation may not be high when students start the test. There is a special problem 
for 12th graders? are not taking any math [11], so many of them have little interest. 

Curriculum Alignment 

There is another factor present in the state graduation test results, with relevance for NAEP. 
As these tests became required, and teachers realized that the tests would actually be enforced as 
graduation requirements, teachers taught more carefully the material that would be tested. This 
accommodation shows up particularly in Maryland data, where the kinds of writing and legal 
knowledge that are tested were not necessarily taught throughout the state before the tests were 
required [12]. A high stakes test gives the test designers great power to control the curriculum 
[13]. 

Any national test that became a graduation requirement or job requirement would have a similar 
effect standardising the curricula, as SAT and ACT now do for college prep courses. The country 
will have to think whether it wants this standardization. For example the 1990 NAEP math test in 
12th grade gives V)% of its weight to geometry and algebra. These go beyond simple applications 
like area equals length times width, to include secants of circles, supplementary angles, conic 
sections, imaginary numbers, and the quadratic formula [14]. For clarity of reporting, the 
objectives should be printed in the final report, so the press and public know what the students 
were expected to know. These math objectives follow recommendations of the National Council of 
Teachers of Mathematics, but are at odds with some minority views [15]. They are also at odds with 
skills listed in the last NAEP test on career development [16]. The goals of each test are set 
primarily by a group of college and public school teachers in the field, who have no special 
expertise on what the general population's needs will be in the 20th or 21st century. 

i 
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On the writing test, NAEP collects 7j-45 minute writing samples [17]. On the other hand 
Simmons [18] found that poor students needed to put 16 days into their writing (though not full 
time!), compared to 13.3 days for the best students and 11.9 days for average students. With this 
amount of work, the poor students rose to about the middle of the class, instead of being much 
lower, as they appear on timed tests. If the NAEP writing test became a high stakes test, teachers 
and students would have to practice 7f45 minute writing samples (with no time for reflection or 
re-writing). This writing drill would be at the expense of longer work, and also at the expense of 
speaking and listening skills, which already get little teaching, and yet are more central to 
"world class" workers than fast writing is [19]. 
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Calculating Student Proficiency 

The appendix to this paper explains how NAEP reviews the pattern of answers to the test c^iestions. 
It explains that each student has an unknown proficiency on each topic in a test Therefore NAEP 
does not estimate one score, but 5 likely scores for each student on each topic tested. We can 
consider these 5 proficiencies as 5 shadow students, each with a different score. The shadow 
students are intended to be a representative sample of all students. 

In the 1990 math test, there were 5 topic areas as well as the 5 shadow students. Thus each 
of the 5 shadow students had 3 topical scores. These were averaged to create an average math score 
for each shadow student. NAEP reports show what fraction of shadow students are above or below 
various cut-offs, based on there average scores, or cased on the 5 sub-scores. The percentages are 
of no g-eat interest, since the scales are set to ensure that about 50% of students are above 250, 
17% are above 300, and 2\% are above 350. The issue is what knowledge the students at each of 
these levels have, that others do net. 

Describing Problems 

NAEP publishes a curriculum simultaneously with administering the test. However they do not 
rank this curriculum from easy to hard. They wait until the test results are in, and then see what 
types of questions the students at various levels tended to get right and wrong [20]. NAEP then 
has groups of educators in the field try to describe the questions in terms of general kinds of 
knowledge (e.g. simple algebra). This procedure is hard, since there are overlapping concepts, 
questions worded in difficult English and questions surrounded by other harder questions. Then 
NAEP shows findings about how many students have each kind of knowledge. NAEP does not interview 
students, so it never knows why they get wrong tlie problems they do [21 ]. 

To show this process more specifically, we return to the 1990 math test. As mentioned above, 
each shadow student had 5 scores in different topics, which were averaged to get an overall math 
score. Then NAEP looked at shadow students who had average scores between 187.5 and 212.5, and 
found what percent of them got each problem right. Problems that at least 65% of these students 
got right (and that at least 100 students attempted or skipped) were considered fairly easy and 
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were used to give examples of what most students can do who scored at 200 or above ("anchor" 
problems). A group primarily of math professors and teachers looked at these problems and 
described them as "simple additive reasoning and problem solving with whole numbers" [22]. They 
also wrote a longer description which mentioned that these students can multiply and divide with a 
calculator [23]. For 8th graders they released 5 of these level 200 problems. The 5 problems 
included knowing a common factor of 10 and 15 (division without a calculator) and solving (150 
► 3) + (6 x 2) (multiplication and division which the authors thought would be done without a 
calculator) [24], so we have to be concerned by the short title which implies these students know 
no multiplication or division. The longer descriptions are not included in the executive summary 
and were not used in news reports. They were not included in the Education Department's own 
article on the results [25] and are only available in the $28 full report, the technical report and 
the state reports. 

NAEP also looked at shadow students who had average scores between 237.5 and 262.5, and looked 
for problems that at least 65% got right, but which 30 percentage points fewer of the shadow 
students at 200 + 12 J got right. The same group primarily of math professors and teachers 
described these problems as "simple multiplicative reasoning and two-step problem solving" [26]. 
Their longer description mentions "factor" and "evaluation of simple expressions" in algebra [27]. 
Similar steps resulted in problems typical of levels 300 and 350, and descriptions of these levels. 
The short title of level 300 includes the words "simple algebra," They mean work more advanced 
than is done at level 250, but the brief titles wrongly imply that no algebra is done at level 250, 
just as they imply no multiplication is done at level 200. 

The present anchor items describe an average of 5 math scores. Each level may include 
students good in statistics but bad in algebra or vice versa. It would be more meaningful to 
describe anchor items for each subscale separately. 

The task of describing common patterns of what students can do is very hard. Often similar 
problems have very different success rates, and it is hard to see a reason. Neither NAEP nor the 
news reports highlight how ambiguous it is to try to say what a group of students can do, based on 
a few test questions. Right answers may often depend on the context of questions [28]. Several of 



the harder 8th grade anchor problems come from a single block of Questions that students found hard 
(41% of problems in this block were answered right on average) [29]. The block started with a 
question on converting 150 minutes to hours, then had an algebra problem and a solid geometry 
problem. It had several other hard algebra and geometry problems, which may have frustrated 
students. Lord pointed out that the presence of hard problems hurts performance even on easy 
problems, since the hard problems take students' time away from the easy problems [30]. 

There are other examples of the problem of o escribing in words what students can do. In the 
1988 writing test, students were asked to write a persuasive letter. The assignment and the 
criteria were described quite differently in two reports on the same test [31 ]: 
Assignment: 

1/90 report: "adopt a point of view about whether or not funding for the space program should 
be reduced,, and to write a letter to their senator, explaining their position." 

6/90 report: "take a stand on whether or not funding for the space prog-am should be cut and 
write a persuasive letter that would convince a legislator of this stand" 
Criteria for minimal: 

1/90 report: take a point of view, not present reasons, no convincing evidence to sway 
senator's vote 

6/90 report: take a stand, briefly support it with one or two relevant reasons 
I have been told that the same test question and scoring criteria were being described in these two 
reports, one on 11th g-adirs, the other on 12th graders [32], The 6/90 version changes the tone of 
writing expected and hides a flaw in the test for Washington DC students, who had no senator (the 
"Dear Senator" seems to have been pre-printed on both answer sheets). The changing definition of 
"minimal" makes the results impossible to interpret. The definition of minimal is key, since half 
the 11th graders are at thL level. Actually either definition should probably be called better 
than minimal, since lobbying groups recommend a simple brief statement of one's stand [33]. There 
is an air of unreality about the assignment anyway: 7{ minutes to convince a senator who has been 
the target of large professional lobbying campaigns? Nor did e ; ther report mention that the time 
available was 1\ minutes. 
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The IAEP report on math and science gives even less information to judge what the different 
score results mean, with only one problem at each scale level in math and science [34]. 

The NAEP staff undoubtedly try to present dear explanations of what is known at each ability 
level The task may be impossible, especially with students learning different aspects of writing, 
math, listening, lobbying, etc in different schools. The report needs to mention these 
difficulties. 

With the 1991 math report, NAEP has made a large improvement in presenting information on math 
achievement. Up through 1988, reports showed how many students scored at and above various scale 
values, but did not mention that many other students also answered right each of the problems 
presented as typical of the scale value (since some students at lower levels also get each problem 
right) [35]. Now NAEP shows what percent of students get each proh'.em right and the press reports 
it. 

Aside from the difficult descriptions of the levels, NAEP now presents the percent of students 
scoring at or above eao< level in meaningful ways. The repor\ talks about students 'demonstrating 
the ability" or "consistent success" or "solid grasp" [36]. These terms are fairly meaningful. 
Students at each level score about 70% on the problems typical of that level. The problems are 
independent, so students do not have a 70% chance of getting them all right, but on average they 
will get 70% of these problems right. Typically about 30% of the students one standard deviation 
lower get each of these problems right. So those lower students show a weak grasp, or inconsistent 
success. By contrast the report on the 1986 math test implied that a level was all or nothing: 
students knew the skills at a level or they did not, which led to the mistaken belief that the 
percent who could do a problem equalled the percent who were at that level [37]. One change that 
would help would be to avoid saying what students can do, based on the test, and say simply what 
they did. As noted above, it is very possible they can do more. 
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HOW THE PRESS REPORTS NAEP 

For this paper I reviewed 15 news accounts of the 1990 NAEP math test, and a few accounts of 
other tests [38]. The press reports are mostly very similar. The headlines usually say students 
are failing (9 out of 15). The text repeats some main numbers from the NAEP report (or from the 
SAT, ACT or norm-referenced test) and some quotes from education professionals who have ideas about 
what should be done. The ideas may change, "choice" in 1990, American Achievement tests and new 
math curricula in 1991, but the pattern of the stories is fairly constant. 

The result is not necessarily a consistent push for a needed reform, but a general belief that 
students, parents, teachers, textbooks and bureaucrats are no good, creating poor morale, 
especially among teachers, without the detailed information that would let someone know what 
improvements to consider. 

The ne>. --papers generally do very direct reporting of NAEP results and the accompanying 
political statements. They report average scores, compare various groups, and quote the 
interpretive statements provided. "Where will the world's innovative discoveries, new solutions 
and Teative products come from in the future? Does it matter?" was quoted from the IAEP report 
on math and science in the Buton Globe [39]. "How many times must this nation be reminded of its 
educational deficits?" was quoted from Secretary Cavazos in an AP story in the New York Times [40] 
on the same IAEP report. "Students are generally ill-equipped to cope confidently with the 
mathematical demands of today's society, such as the graphs that permeate the media and the 
regulations and procedures that underlie credit cards, discounts, taxation, insurance and benefit 
plans" quoted the Richmond Times-Dispatch from the 1991 math report [41 ]. 

The papers generally said most students were not ready for college (11 of 15) or technical 
jobs (8), and that 8th g-aders largely can't do fractions, decimals and percents (10 of 15). 

None of these papers covered any of the following on the 1990 math test; 

Comments from alternative test proponents, such as the supporters of portfolios and 

performance assessments 
Comments from 8th grade teachers or students 

Caveats such as lack of student motivation, average response levels of 80% (down to 62% 
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in Oklahoma), varying percentages of students omitted because they were in private 
schools, small numbers of problems, unfamiliar scientific calculators, etc) 
The issue of whether algebra and geometry should have k0% weight in the 8th grade, though 

these are often not taught by then 
The trial nature of the state testing, with its meaningfulness still in doubt 
Reliability of NAEP descriptions of scores and student failure 
The reporters thought this was a fairly straightforward story, repeating widely known problems. 
They trust NAEP to have large sample sizes (mentioned in 7 of 15 stories), well spread around the 
country [«]. They have little knowledge of psychometric difficulties in interpreting what 
students know. 

Only one story that I saw had a substantially different interpretation from the NAEP report 
itself: the Wall Street Journal said, "States with traditional classroom approaches ranked 
highest in the study." 03]. The two reporters who wrote this article were able to find this 
information in the NAEP data and decided for themselves that it was a significant finding. 

The reporters are generally capable of covering more of the issues on testing and the math 
curriculum, even on small newspaper However they seem to do thorough coverage primarily in 
feature stories, which may develop over time. Newsweek did straight reporting of this test. Time 
did not, but may work it into some more general story in the future [W]. The reporters 
occasionally cover stories on opposing viewpoints, such as a story in the Bismarck Tribune that 
extracurricular activities are predictive of later success in life, while school and college ffades 
and ACT scores are not [45] and a story in the Atlanta Journal that US adults know more science 
than Japanese adults [%]. However the authors who release such reports generally lack the 
publicity resources of NAEP and get much less coverage. 

Time f«r Reporters to Understand the Issues 

In talking to reporters about the coverage of the 1991 math study, they complained that the 
materials were voluminous, and they did not have time to digest them [47]. The press received an 
advisory several days ahead that ' the report was coming out. It did not say so, but the report 
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itself was available noon the day before the press conference, under an embargo. For a 500 page 
report, that gave the press little time to understand it [4£]. 

The reporter for trie smallest paper I spoke to, the Bismarck Tribune, said she needed at least 
a week and preferably two, under embargo, In order to understand the report, get comments from 
teachers and make the story a local story. Even on the day of the press conference, officials in 
her state said they only had two copies of the report and refused to give her one 09]. Larger 
papers did get the report, but also wanted up to a week, also to understand the report and explain 
it better. NAEP worries about a longer period of embargo, saying the results were so sensitive 
they had to be held very tightly [50]. The press did not seem to consider the results so 
sensitive, since most states differed little anyway, and the overall results matched the 
conventional view that students are doing badly. They old not worry that someone might break the 
embargo. On this report, the Boston Globe did break the story a 'day early. Papers worried most 
about competition with TV news in their own markets, and they already lose that race with evening 
TV news, when NCES releases the information at a morning press conference. I think that a longer 
period under embargo would result in better coverage, and the occasional leaks would cause little 
harm. 

Statistical Presentation 

Several reporters mentioned that NAEP wanted them to use the "pantyhose" chart from page 16 
[51 ], to show in a statistically sound way which states outranked which others. It lists all 40 
jurisdictions tested, down the side, and lists them again across the top. For each jurisdiction 
one color shows which other areas are statistically the same. Another color highlights the states 
that scored better (or worse) to a statistically significant degree. The reporters thought such a 
chart was unreasonable for a newspaper, and wanted a simpler presentation. Some papers listed the 
states in alphabetical order, with scores and ranks. Some listed them in rank order. The Pes 
Moines Register and New York Times listed the top and bottom states, ^ewsweek and the New York 
Times showed visually in bar graphs how much and how little the states differed. My impression is 
that the reporters and probably the public had little interest in the details of the ranking, aside 
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from top, bottom, middle. Selden (52) suggested a graphical relation of socioeconomic status of 
states to test scores. The papers might carry such a yaph, but the reporters and probably readers 
would still think the bottom states ought to be improved and the top states probably also, as Mr. 
Selden accepted in his article. I also think in his article Mr. Selden thought there would be more 
statistically significant differences among states than there turned out to be* 

The scale of proficiencies from 200 to 350 was not easy to understand. Newsweek was boldest, 
stressing which grade each score was equivalent to. They went beyond NAEP's careful statement that 
level 300 material is "introduced by the 7th grade" [53] to say "300 is roughly seventh-grade 
work." Some might disagree anyway on whether 2x + 3y + toe is introduced in 7th grade, or the 
inequality sign in 2x > 11 [54], though many of the other items are more clearly 7th grade work. 

Cause and Effect 

NAEP reports do not try to measure cause and effect, and newspapers generally preserve that 
line. The papers usually mentioned some correlates of the scores, especially TV time (14 of 15 
papers), race (12), 2-parent families (12; neither the papers nor the executive summary mentioned 
that these included stepparents), parents with college education, sex and suburb/city comparisons 
(9 each), attendance and poverty (7 each), home reading materials and homework (5 each). On that 
list, schools have some control over homework, but otherwise the aspects that schools can control 
were mentioned rarely: ability groups, school budgets, computers and workbooks were only mentioned 
by 2 papers each. This pattern reflects the stresses in the NAEP report. 

Nevertheless I would not encourage papers to give more play to correlations between scores and 
school actions, since the correlations may be spurious. One would first need to look at each 
effect while controlling for others (in a regression), and one would still have to deal with the 
ambiguity caused by lack of student motivation. For example perhaps ability groups result in lower 
test scores only because they reduce school loyalty and therefore reduce motivation on this kind of 
a voluntary test, while they may have no effect or a positive effect on actual learning. Or there 
may be other spurious connections between ability groups and test scores. There is certainly 
active research on the effectiveness of ability g-oups and other actions schools can take. NAEP is 



probably not the best place to study that kind of specific issue. The same weaknesses apply to the 
demographic issues that do get wide play. As a first step, NAEP can report on multivariate 
analyses to see what contribution each of the variables makes to math proficiency (or at least to 
test scores) when one controls for the other variables. Presenting such information is certainly 
feasible for newspapers They can use concepts like: x points are added to a score by daily use 
of calculators, y points are subtracted for each hour of daily TV watching, etc This multivariate 
appproach, in combination with Selden's graph, might encourage people to see which states are doing 
better than their socio-economic status would suggest, so other states can copy what they are doing 
right. 

Splash 

NAEP reports editorialize more than many government press releases, in order to make a splash. 
The Labor Department says, 'The nation's employment situation was little changed in June ». The 
unemployment rate was 7.0 percent, little different from the May level of 6.9 percent" [55]. The 
Department of Health and Human Services says, "mortality rates for .~ hospitals [were] released 
today «. consumers should use the information in consultation with their physicians. Mortality 
rates «. do not necessarily represent the total performance of a hospital in caring for its 
patients" [56]. 

On the other hand NAEP reports have such phrases as, "a large percentage of students 
approaching high school graduation lack a sense of the national heritage" [57]. 'The 
mathematical skills of our nation's child-en are generally insufficient to cope with either 
on-the-job demands for problem solving or college expectations for mathematical literacy" [58]. 
Yet half of high school graduates do go on to college, and seem to cope; and the 20-24 year old 
unemployment rate seems to have little connection with state by state test scores, so people seem 
to cope at work too [59]: 
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NAEP says the US "is having difficulty maintaining its competitive edge in the global marketplace" 
[60], though our productivity is $24.29 per worker per hour, while Japan's is $12.76 [61 ], and 
anyway in a service economy, most workers are not in danger of their jobs moving abroad. NAEP also 
complains that only 800 students get doctorates in math each year, down from the baby boom years of 
the 70s [62]. The relationship of global competitiveness and doctorates to some of the math 
questions covered in the report seems tenuous. Perhaps NAEP believes its data are less significant 
than the unemployment rate or the hospital dea:i rates, so they have to color their language [63]. 
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IMPROVEMENTS IN NAEP REPORTS 

The NAEP reports would be dearer and have clearer news coverage if they had a longer period 
of release under embargo and if they had a three page summary, with one page on each of the 
following: 

Main findings 
Source of the data 
Limitations of the data 

The first two topics are covered in the present NAEP reports, but limitations are not, so I will 
list some of the items I have in mind: 

Most 8th grade students have not been taught some of the topics tested, such as 
algebra, geometry and probability (totalling about t'M5% of the total score at pade 8) 
[64]. NAEP does not seek to impose a national curriculum and therefore does not 
recommend that schools try to improve scores by teaching they topics they do not want to. 

States also vary in students' educational backgrounds and family lives (such as the 
amount of quiet, stability, and encouragement the students have at home). Therefore some 
states have a haroer time than others in teaching even the same material. 

The results a e biased downwards to some unknown extent, since the test is 
voluntary, so students have no incentive to do their best Differences among scores may 
be caused by inferences in students' willingness to devote energy to a voluntary test. 

In NAEP and any test it is very hard to summarize in words what it is the students 
can do* 

Response rates vary, with 80% responding nationally at 8th grade, or as low as 62% 
in Oklahoma [65] or 65% in 12th grade nationally [66]. Coverage rates are lower than 
response rates, considering the omission of private schools, non-English speaking 
students and special education students. 

The test scores have not been proven to have a relationship to success in later life 
Cpredictive validity"). 

A summary of limitations like these would give the press and the pifclic some orientation to the 



data. Similarly the Department of Health and Human Services in its press release on hospital 
deaths mentions caveats, in a way simple enough for reporters to cover (this example was suggested 
by Jane Norman of the Pes Moines Register) [67]. 

The backup sections in the report would include more detail on each of these sections, and; 
Detailed objectives, Le. the content intended to be covered by the test 
Proficiency on each topic, among students who have been taught that topic 
Actual released tests, with accompanying scripts, percent of students choosing each 
answer or omitting the question, and a, b, c parameters (see appendix) 

Regression coefficients or other multivariate information, showing the effect on 
performance of each background variable or cluster of variables, holding the other 
variables constant; this would largely take the place of the univariate statistics now in 
NAEP reports. 

Non-participation rates, combining student and school non-participation rates, and 
also overall coverage, considering special education, language barriers and private 
schools 

It would also enrich the reports if NAEP could study students' attitudes and thinking 
processes as they take the tests, by observation and by interviews. This is a field where 
cognitive psychologists, child psychologists and anthropologists could be helpful [68]. 

The assignment of grade equivalents to NAEP scores seems very unwise, since curricula can and 
should vary: algebra may be taught in one school in 7th grade and may never be required in another 
school at any grade. To assign any grade equivalent is to assume a certain curriculum, which is 
not NAEP's role. 

Overall, considering the press coverage of NAEP, it is hard to see that the taxpayers are 
receiving information commensurate with the cost of the NAEP tests, and especially the state 
assessments: 

The tests do not cover the major issues generally agreed to be needed in work and 
life: teamwork, work attitudes, speaking and listening skills, etc. 

The content of the tests is not and perhaps cannot be summarized accurately 



o 



The students lack motivation on NAEP tests, and state differences and changes over 
time are within a range that could be explained by differences in motivation 

A test with enough sticks or carrots to create motivation would move control of the 
curriculum to the test-writers 
Most countries do not even try such general tests in their high school examination systems. They 
tell teachers and students years in advance which topics will be tested, give them strong 
incentives, present students questions, usually with a fair amount of choice, and note whether the 
students display a serious understanding of the chosen problems, without trying to generalize to 
broad topics [69]. 

The US does not need tests to make schools accountable. As with doctors, judges, artists or 
mechanics, the difference between good and bad is not a score on a test, but is a complex matter, 
often different in the eyes of different beholders. Qualitative comparisons of schools, by various 
groups, such as newspapers, parents, students and businesses, would be richer and could focus on 
important differences of atmosphere, teaching ability and broad learning, more than test scores do. 
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APPENDIX 

Calculating Students 1 Scores 

One of the purposes of this paper is to explain how NAEP estimates the likely "proficiency" of 
students in NAEP tests, NAEP gives each student several questions on a particular topic As an 
example we can look at the 7th booklet of the 1990 math test. It has 8 problems on data analysis, 
statistics and probability [70]. As mentioned earlier, NAEP looks at which problems students get 
right and wrong, to estimate their "proficiency" or "display." For example within these S 
problems, students who get the easiest * right and the hardest 4 wrong, are likely to be 
distributed in their proficiency according to the following curve [71 ]: 
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At the far left or right, a few low or high students may accidentally get the easiest * right and 
the other 4 wrong, but most students who have these 4 right and 4 wrong answers are likely to be 
middle ability students. Students who get all 8 answers right are likely to be distributed 
according to the following curve [72]; 
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However NAEP finds that curve hard to work with, so they use the following curve instead [73]; 
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This lowers the scores of top students to be closer to middle students. They do a similar change 
at the bottom. Other students might get 4 problems right and 4 wrong, but out of order, say they 
get wrong the 2 easiest and the 2 hardest Such students are likely to be distributed in their 
ability according to the following curve, very similar to the first curve: 



I haven't yet labelled the scale of student abilities from left to right. I haven't explained how 
these likely distributions of students are figured out. I haven't explained how the distribution 
of the total population is figured out from these distributions for individuual patterns of scores. 
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I haven't labelled the scale, since it is arbitrary, in the same way that Fahrenheit and 
Centigrade temperature are. The zero can be anywhere, and the step can be any size [74]. NAEP 
takes the average oof all students in 4th, 8th and 12th g-ades, and calls it 250.5. They take the 
standard deviation of these students' proficiencies and call it 50 [75]. If the scores followed a 
normal bell-shaped curve, 17% of all students would be below 200, 17% above 300 and 2J% above 350. 
In fact only about 10% are below 200, 22% are above 300 and about 2% are above 350 [76] taking all 
gades together, so the distribution is slightly skewed As an alternative scale, one could label 
the scores with the average at 704 (Independence Day), and a standard deviation of 1. One would 
still have about 2\% of students above 706. With these labels the first two curves above would be: 



The scale from 700 to 708 creates a subtle impression that there is not much difference in 
knowledge. The scale from 50 to 400 implies that people at 400 know twice as much as the people at 
200. On a vocabulary test it may be meaningful to know twice as many words (though there are 
rapidly diminishing returns, since 8,000 words account for 90% of written English, and knowing 
another 8,000 words only accounts for another 5%; [77 ]X On most tests there is no obviously 
meaningful scale, and the reader must guard against thinking that 400 is twice as good as 200. 
NAEP tests are designed to distinguish students from each other, not to measure what they all know. 
NAEP reports themselves never make the mistake of interpreting the scale in terms of percentage 
differences, but they do not always say how arbitrary the scale is, and newspapers do say things 
like "Georgia ranked ~. 30 percent from the bottom" [78]. 
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Probability of a Right Answer to a Specific Question 

We need to analyze each question before analyzing a whole pattern of answers to oyestions. 
NAEP fits a mathematical curve to each question, showing the probability of a correct answer from 
students at different proficiencies. Here is the curve for a question on calculating the average 
age of 5 chikJrens 13, 8, 6, 4, 4, with multiple chokes: 4, 6, 7, 8, 9, 13, don't know (no 
calculator available) [79]: 




r## 7»j tmt y»r t»« wv 

Low students have a 22% chance of getting the right answer, so they're doing better than random 
guessing. High students have very good odds of getting the right answer. The problem is pretty 
good at distinguishing between low and high students, but not between low and very low or high and 
very high students, since the chance of a right answer is not very different once you get below 703 
or above 706. For making small distinctions between similar students, the problem is best between 
70* and 705, since the chance of a right answer improves fairly fast in that range. In fact the 
problem has a steeper slope than most, perhaps because it is near the end of its test, so it is 
measuring both speed and knowledge. (It also measures agreement that medians are not averages, 
which may trouble some. NAEP accepts 7 but not 6 as an answer. In the 1990 objectives book, 
authors of a similar question thought they needed to say specifically arithmetic mean when they 
asked for an average [80], so it is not dear why the authors here thought the term average by 
itself was unambiguous.) The equation of the curve is [81]: 



p.c + U-c»/<l*.-'- y «< 0 - M > 
where p is the dance of a right answer, 0 is a student's proficiency, e is the mathematical 
constant 2.7183 and a, t c are calculated by NAEP to fit the curve to real data as closely as 
possible. For example on this question NAEP calculated c=.214 (the guessing level, a lower 
asymptote), b=.lW (the difficulty), and a= 1.368 (the steepness). For open ended questions c=0. 
On the 1990 test of data analysis, statistics and probability, the difficulties range from -3.623 
(easiest, the bar graph on p. 63 of the full report) to 1.183 (not released, but it involved 
media; il The steepnesses ranged from J33 (gentlest slope, the g*aph on p. 63) to 1.983 (not 
released, but it involved interpreting a circle g-aph) [82]. To give a sense of the range of curve 
shapes, we show these three problems here; 






a 


b 


c 




Problem 


Steepness 


Difficulty 


Guessing 




3d 


.333 


-3.623 


.175 


easiest problem and gentlest slope 


8e 


1.983 


.788 


.216 


steepest slope 


7r 


.860 


1.183 


.140 


hardest problem 



The problems are identified by their block (from 3 to 9; blocks 1 and 2 were background qjestions) 
and by the question order within each block (from a to v, representing questions 1 to 23). 

From the actual test results, NAEP calculates ©, a, b and c, and there is room for error. 
The testing literature has articles critiquing various ways of calculating these figures and 
simulating the amount of error resulting. The following table from Mislevy illustrates the 



problems, using simulated data, where it is possible to know what the true values are, unlike real 
tests, where the true values are never known [83]. 





a 




b 




c 




Ouestion 


True 


Est 


True 


Est. 


True 


Est 


1 


1.1 


1.3 


-.4 


-.3 


.11 


.17 


2 


.5 


.4 


.2 


.6 


.19 


24 


3 


.9 


1.1 


-1.3 


-1.0 


.26 


.27 


4 


1.4 


1.4 


-1.0 


-1.0 


.17 


.19 


5 


1.5 


2.4 


-J 


-.2 


.13 


.14 


6 


2.5 


3.4 


-1.1 


-1.1 


.18 


.18 



Source: Mislevy, 1986 

Besides errors in the parameters, curves may have different shapes from the equation assumed, with 
other bends and twists. There may be other important variables. NAEP recognizes that different 
curves may be appropriate for different states, but they derive one set of curves in order to 
"maintain an equal measure for establishing comparisons among participating jurisdictions," They 
recognize this may mean the measure fits the curriculum and answer patterns of some states more 
than others [84]. 

Probability of a Pattern of Answers 

Once NAEP has an equation for each problem, the probability p of getting it right can be 
calculated for each 0. The probability of getting the problem wrong is 1 - p. For each ©, 
problems are seen as independent [85], and each can have its p calculated, as p (f p t , etc. In 
order to calculate the chances of getting two problems right we can multiply the two probabilities 
(just as the chance of 2 heads is { x \ - i). The chance of getting the first four problems right, 
and the next four wrong is: 

Z = P, PaP,P.,tt " P f ) U " P4) 0 - P,) U - P f ) 

Remember each of these p,- depends on its values of a, b, c and 0. A, b and c are fixed for each 
We can choose values of 0 from low proficiency to high, calculate each p, then calculate z, 



then graph the curve of a 
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The curve is low on the left, since low levels of proficiency mean the probability of getting any 
item right is fairly small, so the product z is small. For high proficiency, the probability of 
getting any item wrong is small, so again the product z is small. In the middle, the probabilities 
of right and wong answers are not so small, and the product z rises to its maximum. This curve is 
treated as the likely distribution of proficiencies for students who had this pattern of 4 right 
and 4 wrong answers on the test. 

Combining Different Patterns into a Distribution for the Whole Population 

NAEP does not simply add these distributions for all students. They recreate the total 
population by using various equations. 

NAEP finds the mean of each distribution. Then NAEP tries to find one equation (a 
"regression") that calculates as many as possible of these means (for different students) as 
closely as possible. The equation takes into account background information on the student and the 
student's school (race, sex, parents' education, teaching practices, etc.) [86]. On average being 
black means fewer right answers and a lower distribution of proficiency, 50 does low parental 
education. So may certain teaching practices (though NAEP does not report their findings on the 
effects of teaching styles). 

But of course not all students are at the mean proficiency of their group as calculated by the 
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equation, or even at the mean of their own personal curve. Students are scattered ail over the 
curve. They are simply considered more likely to be in the larger parts of the curve. NAEP picks 
5 proficiency values for each student, randomly Cplausible vaiues" or "imputations") [87]. These 
are spread somewhat from the mean of their group, but not spread as much as the personal curves go. 
This is called a "posterior distribution," resulting from the "prior" assumption that students are 
more likely to be like other members of their groups than spread all over their own distributions. 

There is also a "prior" assumption that all groups of students have the same dispersion 
(standard deviation). Mislevy et aU [88] state that these techniques preserve the mean, standard 
deviation, and shapt >t the distribution of proficiencies actually in the population. 

How Reliable Are Short Tests? 

The reader may have had a qualm when we mentioned that NAEP was analyzing a test with only 6 
or 8 items. The qualm may have become anxiety when we pointed out that each question can 
distinguish detailed levels of proficiency in only a small stretch of the distribution. So at some 
levels, estimates of proficiency may be affected seriously by a single question. 

There is a formula to measure the amount of information a test provides at each level of the 
proficiency distribution [89]. This is the amount of information available to distinguish one 
proficiency from another. The formula is: 

I, ♦ «M ♦ I« 

This formula adds up the information for all questions, where the I for each question is: 

I = 2^9a«(l - c) / «c ♦ 1/k) (1 ♦ k) a ) 
where k*e" 1,7a * 0 For exam P le of test 7 in the 1990 math test, with 8 problems, the top 
of the following graph shows the amount of information available from each question, the total 
information, and the information from four questions on graphs. The test includes k questions on 
graphs, 2 on probability, and 1 each on averages and sample bias [90]. 

On the scale for amount of information [91 ], one means approximately the amount of information 
from one good question. The total information in the middle of the proficiency distribution is 
equivalent to about 2-3 good questions at each point. However at 350 there is effectively only one 
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question, which is the question on average ages shown earlier. The bottom of the graph shows the 
probability of a correct answer on each question separately [92]. 
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The next graph shows the effect on a student's likely distribution if she misses that one question, 

/ 

The mean of the distribution drops from 400 to 320. 
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Student proficiency estimates depend seriously on a relatively few questions at the high end of the 
difficulty range, even when we consider all test booklets together, not just the questions posed to 
one student If there are too few hard questions (or easy ones) to be a fully representative 
sample of the domains of mathematics that they should represent, the test is weakened. For example 
the 1990 subscore on data analysis, probability and statistics has only 19 questions in all, and 
only 5 with b parameters at level 300 and above [93]. At best one can see these 19 problems as a 
well-stratified systematic sample of a certain domain of knowledge. This is a fairly small sample 
size, from a large domain. It is possible that NAEP is trying too much when it tries to measure a 
wide range in 5 abilities in 45 minutes. 

In the 1990 math test, the largest number of math questions were on "numbers and a rtions" 
with a total of 46 questions [94], which is also a limited sample. As a further example of xest 
information, booklet 4 had 23 questions on numbers and operations [95]. The following graph shows 
the total information available and the probability of a correct answer on each question. Again, 
little Information is available around level 350. 
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